Skip to content

Add async overloads to multi-node TestConductor APIs#7750

Merged
Aaronontheweb merged 30 commits into
akkadotnet:devfrom
Aaronontheweb:mntr-async-overloads
Aug 13, 2025
Merged

Add async overloads to multi-node TestConductor APIs#7750
Aaronontheweb merged 30 commits into
akkadotnet:devfrom
Aaronontheweb:mntr-async-overloads

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Member

Summary

Addresses GitHub issue #4146 (open for 5+ years) by adding async versions of all blocking TestConductor methods to eliminate thread pool starvation and timeout issues in multi-node tests.

Changes

  • Added async overloads with CancellationToken support to all TestConductor methods:
    • EnterAsync(), EnterBarrierAsync()
    • ExitAsync(), BlackholeAsync(), PassThroughAsync()
    • GetAddressForAsync(), GetNodesAsync(), RemoveNodeAsync()
    • NodeAsync(), ThrottleAsync()
  • Implemented sync-over-async pattern for backward compatibility - existing synchronous methods now delegate to async versions
  • Migrated test cases to demonstrate usage:
    • TestConductorSpec - validates the new async APIs work correctly
    • RemoteNodeDeathWatchSpec - shows full async migration pattern

Why This Matters

Based on investigation, the lack of async TestConductor methods is the root cause of:

  • 20+ second timeout failures in multi-node tests
  • Thread pool starvation under CI load
  • Deadlocks in barrier synchronization
  • Flaky multi-node tests across the entire Akka.NET ecosystem

Compatibility

  • ✅ 100% backward compatible - no breaking changes
  • ✅ All existing tests continue to work unchanged
  • ✅ Tests pass on both migrated async and original sync versions

Next Steps

  • Gradually migrate remaining multi-node tests to use async APIs
  • Update MNTR (Multi-Node Test Runner) to remove blocking calls
  • Create migration guide for users

Fixes #4146

Addresses GitHub issue akkadotnet#4146 (open for 5+ years) by adding async versions
of all blocking TestConductor methods to eliminate thread pool starvation
and timeout issues in multi-node tests.

Changes:
- Added EnterAsync() methods with CancellationToken support
- Added ExitAsync(), BlackholeAsync(), PassThroughAsync() methods
- Added GetAddressForAsync(), GetNodesAsync(), RemoveNodeAsync() methods
- Added EnterBarrierAsync() and NodeAsync() to MultiNodeSpec
- Implemented sync-over-async pattern for backward compatibility
- All existing synchronous methods now delegate to async versions

This maintains 100% backward compatibility while providing async alternatives
that will eliminate the 20+ second timeout failures in CI/CD pipelines.
Migrated existing TestConductorSpec test to use the new async methods:
- Changed test method to async Task
- Replaced Throttle().Wait() with ThrottleAsync()
- Used RunOnAsync for async operations

This validates the new async APIs work correctly and provides an example
of how to migrate existing multi-node tests.
These files are not needed for the project and were accidentally included.
Successfully migrated two test methods to be fully async:
- RemoteNodeDeathWatch_must_receive_Terminated_when_watched_node_crashAsync
- RemoteNodeDeathWatch_must_cleanup_when_watching_node_crashAsync

Changes:
- Converted test methods to async Task
- Replaced TestConductor.Exit().Wait() with ExitAsync()
- Used RunOnAsync for all async operations
- Replaced EnterBarrier with EnterBarrierAsync
- Added System.Threading.Tasks using directive

All 6 test node variations pass successfully, demonstrating the async
APIs work correctly in real multi-node test scenarios.
Replaced .Wait() calls with .GetAwaiter().GetResult() to avoid potential
deadlocks while maintaining synchronous execution model. This is a
transitional step before fully converting StressSpec to async.

Changes:
- TestConductor.Exit().Wait() -> GetAwaiter().GetResult()
- TestConductor.Blackhole().Wait() -> GetAwaiter().GetResult()

This eliminates the blocking wait calls that were causing thread pool
starvation in CI environments.
StressSpec contains deeply nested synchronous test structures (Within,
RunOn, ReportResult) that make it impossible to properly await async
TestConductor methods without a complete rewrite.

The current TestConductor.Exit() and TestConductor.Blackhole() methods
now internally use the async versions, which is an improvement over
the previous direct blocking calls, but a full async conversion of
StressSpec would require:

1. Converting Within() to support async operations
2. Converting ReportResult() to be async
3. Converting all test orchestration methods to async
4. Updating all callers throughout the test

This is tracked as future work.
- Convert main test method Cluster_under_stress to async Task
- Convert all Must* helper methods to async Task
- Fix ReportResult lambda expressions to be async
- Use WithinAsync instead of Within for async operations
- Replace all TestConductor.Exit().Wait() with await TestConductor.ExitAsync()
- Use RunOnAsync for async operations
- Replace EnterBarrier with EnterBarrierAsync calls
- Convert main test method LeaderElectionSpecs to async Task
- Convert ShutdownLeaderAndVerifyNewLeader to async Task
- Replace TestConductor.Exit().Wait() with await TestConductor.ExitAsync()
- Convert all Cluster_of_four_nodes_* methods to async Task
- Use WithinAsync instead of Within for async operations
- Replace all EnterBarrier calls with EnterBarrierAsync
- Add using System.Threading.Tasks
- Convert main test method to async Task
- Convert all test helper methods to async Task
- Replace TestConductor blocking calls with async versions:
  - Blackhole().Wait() -> BlackholeAsync()
  - PassThrough().Wait() -> PassThroughAsync()
  - Exit().Wait() -> ExitAsync()
- Replace all EnterBarrier calls with EnterBarrierAsync
- Use RunOnAsync for async operations
- Add using System.Threading.Tasks
- Create comprehensive migration guide (MULTINODE_TEST_ASYNC_MIGRATION.md)
  - Documents all blocking patterns to replace
  - Provides before/after code examples
  - Lists common pitfalls to avoid
  - Includes complete checklist of 34 test files needing migration

- Add migration status checker script (check-multinode-migration.sh)
  - Automatically finds tests with blocking TestConductor calls
  - Counts .Wait() calls per file
  - Identifies tests using synchronous EnterBarrier
  - Tracks tests using Within that may need WithinAsync
  - Provides color-coded status output

Current status:
- 34 test files still have blocking TestConductor calls
- 116 files still use synchronous EnterBarrier
- 78 files use Within blocks that may need async conversion
- 5 tests already migrated successfully

This tooling helps track and manage the async migration effort across
all 157 multi-node test files in the codebase.
- Convert test methods to async Task
- Replace EnterBarrier with EnterBarrierAsync
- Use WithinAsync for async timing constraints
- Use RunOnAsync for async operations
- Convert WhenTerminated.Wait() to WaitAsync()
- Keep TestConductor.Shutdown().Wait() as no async version exists
- Add using System.Threading.Tasks

This eliminates race conditions caused by blocking calls and thread pool starvation.
@Aaronontheweb Aaronontheweb added akka-testkit Akka.NET Testkit issues multi node spec labels Aug 12, 2025
Copy link
Copy Markdown
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions

Comment thread src/core/Akka.Cluster.Tests.MultiNode/StressSpec.cs Outdated
Comment thread src/core/Akka.Cluster.Tests.MultiNode/LeaderElectionSpec.cs Outdated
Comment thread src/core/Akka.Remote.TestKit/MultiNodeSpec.cs Outdated
Comment thread src/core/Akka.Remote.TestKit/Player.cs Outdated
Comment thread src/core/Akka.Remote.Tests.MultiNode/TestConductor/TestConductorSpec.cs Outdated
@Arkatufus
Copy link
Copy Markdown
Contributor

Waiting to see if everything turned green for this PR

Copy link
Copy Markdown
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing my "request for change" since I've modified the PR.
Would love to have another pair of eyes to check this PR.

Copy link
Copy Markdown
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some no-nos

"controller");
"controller");

var node = await _controller.Ask<IPEndPoint>(TestKit.Controller.GetSockAddr.Instance, Settings.QueryTimeout).ConfigureAwait(false);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

{
try
{
var result = await Controller.Ask(new Terminate(node, new Right<bool, int>(exitValue)), Settings.QueryTimeout, cancellationToken).ConfigureAwait(false);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove all ConfigureAwait(false) from here

This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

akka-testkit Akka.NET Testkit issues multi node spec

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add async overloads to MultiNode test APIs, and update existing tests

2 participants